A Fine-Grained Pipelined Implementation of LU Decomposition on SIMD Processors

نویسندگان

Kai Zhang

Shuming Chen

Wei Liu

Xi Ning

چکیده

The LU decomposition is a widely used method to solve the dense linear algebra in many scientific computation applications. In recent years, the single instruction multiple data (SIMD) technology has been a popular method to accelerate the LU decomposition. However, the pipeline parallelism and memory bandwidth utilization are low when the LU decomposition mapped onto SIMD processors. This paper proposes a fine-grained pipelined implementation of LU decomposition on SIMD processors. The fine-grained algorithm well utilizes data dependences of the native algorithm to explore the fine-grained parallelism among all the computation resources. By transforming the non-coalesced memory access to coalesced version, the proposed algorithm can achieve the high pipeline parallelism and the high efficient memory access. Experimental results show that the proposed technology can achieve a speedup of 1.04x to 1.82x over the native algorithm and can achieve about 89% of the peak performance on the SIMD processor.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Multiprocessor Architecture Combining Fine-Grained and Coarse-Grained Parallelism Strategies

A wide variety of computer architectures have been proposed that attempt to exploit parallelism at different granularities. For example, pipelined processors and multiple instruction issue processors exploit the fine-grained parallelism available at the machine instruction level, while shared memory multiprocessors exploit the coarse-grained parallelism available at the loop level. Using a regi...

متن کامل

Efficient Exploitation of Parallelism on Pentium III and Pentium 4 Processor-Based Systems

Systems based on the Pentium III and Pentium 4 processors enable the exploitation of parallelism at a fineand medium-grained level. Dualand quad-processor systems, for example, enable the exploitation of mediumgrained parallelism by using multithreaded code that takes advantage of multiple control and arithmetic logic units. Streaming Single-Instruction-Multiple-Data (SIMD) extensions, on the o...

متن کامل

A Domain-Specific Architecture for Elementary Function Evaluation

We propose a Domain-Specific Architecture for elementary function computation to improve throughput while reducing power consumption as a model for more general applications: support fine-grained parallelism by eliminating branches, eliminate the duplication required by co-processors by decomposing computation into instructions which fit existing pipelined execution models and standard register...

متن کامل

Implementing Linear Algebra Routines on Multi-core Processors with Pipelining and a Look Ahead

Linear algebra algorithms commonly encapsulate parallelism in Basic Linear Algebra Subroutines (BLAS). This solution relies on the fork-join model of parallel execution, which may result in suboptimal performance on current and future generations of multi-core processors. To overcome the shortcomings of this approach a pipelined model of parallel execution is presented, and the idea of look ahe...

متن کامل

Fast parallel solver for the levelset equations on unstructured meshes

The levelset method is a numerical technique that tracks the evolution of curves and surfaces governed by a nonlinear partial differential equation (levelset equation). It has applications within various research areas such as physics, chemistry, fluid mechanics, computer vision, and microchip fabrication. Applying the levelset method entails solving a set of nonlinear partial differential equa...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2013

A Fine-Grained Pipelined Implementation of LU Decomposition on SIMD Processors

نویسندگان

چکیده

منابع مشابه

A Multiprocessor Architecture Combining Fine-Grained and Coarse-Grained Parallelism Strategies

Efficient Exploitation of Parallelism on Pentium III and Pentium 4 Processor-Based Systems

A Domain-Specific Architecture for Elementary Function Evaluation

Implementing Linear Algebra Routines on Multi-core Processors with Pipelining and a Look Ahead

Fast parallel solver for the levelset equations on unstructured meshes

عنوان ژورنال:

اشتراک گذاری